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Abstract 

Background: Traditional strategies for selecting variables in high dimensional classification problems aim to find 
sets of maximally relevant variables able to explain the target variations. If these techniques may be effective in 
generalization accuracy they often do not reveal direct causes. The latter is essentially related to the fact that high 
correlation (or relevance) does not imply causation. In this study, we show how to efficiently incorporate causal 
information into gene selection by moving from a single-input single-output to a multiple-input multiple-output 
setting. 

Results: We show in synthetic case study that a better prioritization of causal variables can be obtained by 
considering a relevance score which incorporates a causal term. In addition we show, in a meta-analysis study of 
six publicly available breast cancer microarray datasets, that the improvement occurs also in terms of accuracy. The 
biological interpretation of the results confirms the potential of a causal approach to gene selection. 

Conclusions: Integrating causal information into gene selection algorithms is effective both in terms of prediction 
accuracy and biological interpretation. 



Background 

Supervised analysis of genomic datasets (gene expression 
microarray or comparative genomic hybridization array 
for instance) with a large number of features and a 
respectively small number of samples requires the adop- 
tion of either regularization or feature selection strate- 
gies [1]. The most common feature selection strategies 
select or rank the variables according to a relevance 
score. In ranking, for instance, the score of each variable 
is the univariate association with the target returned by 
a measure of relevance, like mutual information, correla- 
tion, or p-value. If on one hand the ranking is widely 
used for its simple implementation and its low complex- 
ity, on the other hand it suffers from well-known limita- 
tions. A drawback is that ranking relies on univariate 
terms and as such it cannot take into consideration 
higher-order interaction terms or redundancy between 
features [2]. Another limitation is that ranking techni- 
ques are not able to distinguish between causes and 
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effects. This is due to the fact that univariate correlation 
(or relevance) does not imply causation [3]. This pro- 
blem is not solved in multivariate feature selection 
approaches since their cost function typically takes into 
consideration accuracy but disregards causal aspects. 
Nowadays the importance of bringing causality into play 
when designing feature selection methods is more 
widely acknowledged in the bioinformatics and the 
machine learning communities [4,5]. This is typically 
the case in microarray classification, where the goal is, 
for example, to distinguish between tumor classes or 
predict the effects of therapies on the basis of gene 
expression profiles [6]. In these settings the number of 
input variables, represented by the number of gene 
probes, is huge (typically several thousands) while the 
number of samples, represented by the patients' tumors, 
is very limited (a few hundreds) making the selection of 
relevant genes a challenging task. Moreover the infer- 
ence of causal relationships between variables plays a 
major role in the context of genomic studies since more 
and more biologists and medical doctors expect data 
analysis to provide not only accurate prediction models 
(for prognostic purposes) but also insights into the 
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mechanisms associated with disease and appropriate 
therapeutic targets. 

It is well established that the detection of causal pat- 
terns cannot be carried out in a bivariate (single-input 
single-output) context and that at least a trivariate set- 
ting has to be considered [7]. This is put into evidence 
by the literature on graphical models where arc orienta- 
tion relies on notions of conditional independence 
(requiring at least three terms) [8] and by the work on 
information theoretic methods for network inference 
[9]. In particular this paper will focus on the notion of 
feature interaction, a three-way mutual information that 
differs from zero when group of attributes are comple- 
mentary [10]. The role of interaction in feature selection 
has already been discussed in the machine learning lit- 
erature. Jakulin proposed an heuristic based on interac- 
tion for selecting attributes within the naive Bayesian 
classifier [11]. Meyer et al. proposed a filter algorithm 
which relies on the maximization of an information the- 
oretic criterion, denoted Double Input Symmetrical 
Relevance (DISR), which implicitly takes into account 
the interaction, or complementarity between variables, 
in the choice of the features [12]. Watkinson et al. used 
a notion of synergy related to feature interaction to 
assign a score to a pair of genes and then measured the 
degree of confidence that one of the genes regulates the 
other [9]. A causal filter algorithm which computes 
interaction between inputs has been recently proposed 
in [5]. However it is unclear whether these techniques 
are capable of recovering the set of features that are 
both relevant and causal, in high-dimensional problems, 
such as in microarray analysis. 

The contributions of this paper can be summarized as 
follows. First we introduce a new causal filter based on 
the interaction information and we show how to esti- 
mate this quantity in a multiple-input multiple-output 
setting. Second we assess the capacity of such filter to 
prioritize causal variables by using a synthetic case 
study. Third we measure from an accuracy and a biolo- 
gical point of view the performance of such causal filter 
in a number of prognostic studies in breast cancer. We 
advocate that a multiple-input multiple-output approach 
is particularly relevant in clinical studies where it is 
common that more than a single target variable is col- 
lected. This is the case of prognostic studies of breast 
cancer patients where several clinical indices, including 
patients' tumor size and histological grade, are collected 
together with the survival of the patients and the gene 
expressions of their tumor. It is worth to note that, in 
spite of their availability, these additional phenotypes are 
usually not taken into consideration since statistical stu- 
dies focus on survival prediction and adopt single-out- 
put methods. 



This paper describes an original multiple-input multi- 
ple-output score which combines a conventional rele- 
vance term with a causal term. This additional term 
quantifies the causal role of the features and allows the 
prioritization of causal variables in the resulting ranking. 
We carried out a synthetic study, where the set of causal 
dependencies is known, which shows that causal vari- 
ables are highly ranked once this score is adopted. We 
performed a meta-analysis of six publicly available breast 
cancer microarray datasets to assess the improvement of 
using our causal relevance score in terms of accuracy 
over the conventional ranking. The related discussion 
shows also that it is possible to carry out a biological 
interpretation of the role of selected variables which 
allows to discriminate between potentially causal and 
relevant, yet non causal, features. The source code, doc- 
umentation and data are open-source and publicly avail- 
able from http://mlg.ulb.ac.be/software/ and http:// 
compbio.dfci.harvard.edu/pubs/mimocausal/. 

Methods 

Mutual information and interaction 

Let us consider a multiple-input multiple-output 
(MIMO) classification problem characterized by n input 
variables X = {x i} i = 1,..., n} and m targets Y = {y ; , / = 
1,..., m} where Xf e X is continuous and 
Yj £ yj = {Cji, . . . ,Cjc\. Let us denote y x as the primary 
target and the remaining m - 1 outputs as secondary 
targets. We make this distinction since, though we 
assume that the goal of classification is to predict y lf we 
want to take advantage of the causal information which 
can be extracted by multiple targets. We begin by 
reviewing some notions of information theory by con- 
sidering three random (boldface) variables, notably two 
inputs Xi, x 2 and the primary target y l4 The mutual 
information [13] between the continuous variables x x 
and x 2 is defined in terms of their probabilistic density 
functions p(xi), p(x 2 ) and p(xi, x 2 ) as 

/(xi;x 2 ) = j j p(xi, x 2 ) log ^ cbadxj = H(xi) - H( Xl |x 2 ) (1) 

where H is the entropy and the convention 

Olog^ = 0 is adopted. This quantity measures the 

amount of stochastic dependence between x 1 and x 2 
and is also called two-way interaction [11]. Note that, if 
Xi and x 2 are Gaussian distributed the following relation 
holds 

J(x 1 ;x 2 ) = -ilog(l- / o 2 ) (2) 
where p is the Pearson correlation coefficient. 
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Let us now consider the target y lt too. The condi- 
tional mutual information /(xij x 2 |yi) [13] between x 1 
and x 2 , once yi is given, is defined by 



/// 



^ m P{xi r x 2 \yi) A A A 

p[xi,x 2/ yi) log dx l dx 2 dyi 
p{x x \yi)p(x 2 yi) 



The conditional mutual information is null iff Xx and 
x 2 are conditionally independent given y x . The change 
of dependence between Xi and x 2 due to the knowledge 
of yi is measured by the three-way interaction informa- 
tion defined in [14] as 



I(x 1 ;x 2 ;y 1 )=r(x 1 ;y 1 )-Z(xi;y 1 |x2) = 



-H(xi,x 2 ) -H( Xl ;yJ -H(x 2 ;y 1 ) +H(x 1 ) +H(x 2 ) +H{ Yl ) +H(x 1 ,x 2 ,y 1 ) 



(3) 



This measure quantifies the amount of mutual depen- 
dence that cannot be explained by bivariate interactions. 
When it is different from zero, we say that x 1? x 2 and y 1 
three-interact. A non-zero interaction can be either 
negative, which denotes a synergy or complementarity 
between the variables, or positive, which indicates 
redundancy. Because of the symmetry of the H operator 
in (3), we have 



7(xi;x 2 ;y 1 ) = I{x l ;y l ) -/(x^yjx^ 
= J( x 2;yi) -/(x^yjxi) 
= J(xi;x 2 ) - J(xi;x 2 |y 1 ) 

By (4) we derive 

J(xi;y|x 2 ) = I(xi;y) - J(xi;x 2 |y) 



(4) 



(5) 



Since the joint information of x 1 and x 2 to y 1 can be 
written as 

/((xi;x 2 );y) = J(x 2 ;y) + Z(xi;y|x 2 ) 

it follows that by adding 7(x 2 ; y) to both sides of (5) 
we obtain 



J( (xi ; x 2 ); y) - J(xi ; y) + 1 (x 2 ; y) - Ifa ; x 2 ; y) = J(xj ; y) + 7(x 2 ; y) + J(xi ; x 2 |y) - I(xi ; x 2 ) 



(6) 



Note that the above relationships hold also when 
either x 1 or x 2 are vectorial random variables. 

Feature selection, causality and interaction 

Consider a multiple-class classification problem where x 
e X <= R w is the n-variate input and y x e y is the pri- 
mary target variable. Let A = {1,..., n} be the set of 
indices of the n inputs. Let us formulate the feature 
selection problem as the problem of finding the subset 
X" of v > 0 variables such that 

X* = arg max J(X s ;y 1 ) = arg max s(X s ) (7 \ 

ScA:\S\=v ScA:\S\=v V > 

where the score s(Xs) of a subset X s of variables is 
given by the mutual information it brings to the target. 



In other words, for a given number v of variables the 
optimal feature set is the one that maximizes the infor- 
mation about the target. Note that this formulation of 
the feature selection problem, also known as Max- 
Dependency [12,15], is classifier-independent. 

If we want to carry out the maximization (7), both an 
estimation of I and a search strategy in the space of sub- 
sets of X are required. As far as the search is concerned, 
according to the Cover and Van Campenhout theorem 
[16], to be assured of finding the optimal feature set of 
size v, all feature subsets should be assessed. Given the 
infeasibility of exhaustive approaches for large n, we will 
consider here only forward selection search approaches. 
Forward selection starts with an empty set of variables 
and incrementally updates the solution by adding the 
variable that is expected to bring the best improvement 
(according to a given criterion). The hill-climbing search 
selects a subset of v <n variables in v steps by exploring 
only XXo ( n ~~ 1) configurations. For this reason the 
forward approach is commonly adopted in filter 
approaches for classification problems with high dimen- 
sionality [17,18]. 

If v = 1 the optimal set returned by (7) is composed of 
the most relevant variable, that is the one carrying the 
highest mutual information to y. For v > 1, we need to 
provide an incremental solution to (7) in order to 
obtain, given a set of d variables, the (d + l) th feature 
which maximizes the increase of the dependency 



arg max s((X s ,x fe )) 

x fe eX-X s 



(8) 



where (X 5 , x^) stands for the set of variables resulting 
from the union of X s and x^. Since for large d the term 
s((Xs,Xfc)) requires the computation of multivariate 
mutual information, its estimation is often prone to ill- 
conditioning and large variance. This led to the adop- 
tion of low variate approximations in literature, like the 
univariate approximation 



x d + i = ar § max 5{x k ) = arg max /(x/^y,) 

X fe GX-X S X fe GX-X S 



(9) 



which leads to a ranking of the variables according to 
their mutual information with the target. More 
advanced approaches rely on bivariate decompositions 
[12] like 



= arg max 

x fe GX-X s 



2 J2 5 (( Xi ' Xk ^ 



(10) 



x/GXs 



where s((xj,Xfc)) quantifies the amount of information 
that x/ and x^ contain jointly about y v 

However a feature selection procedure targeting the 
Max-Dependency is not able in general to discriminate 
between causal and non causal dependencies. For 
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instance in a selection procedure applied to a dataset 
derived from a causal process like the one in Figure 1, 
the effect x 4 could be more highly ranked than the 
direct causes x x and x 2 . 

Here we propose to modify the conventional score 
s(X) into a causal score s c (X) able to keep into consid- 
eration the causal information returned by the adoption 
of a multiple output configuration. This is made possible 
by integrating in the score an interaction term which is 
strictly related to the notion of causal dependency. 
Interaction and causal dependency 

This section aims to establish the link between informa- 
tion theory and causality. Causality is at the same time 
an essential and imprecise notion in scientific discovery. 
In order to avoid any ambiguity, here we adopt the 
formalism of causal Bayesian network which is a sound 
and convenient framework for reasoning about causality 
between random variables [8]. This means that all causal 
dependencies between variables are expressed by a 
directed acyclic graph where the existence of an 
oriented edge from a node x, to a node x ; means that x, 
directly causes x ; . In formal terms we assume that the 
Causal Markov condition, the Causal Faithfulness and 
the Causal Sufficiency conditions hold [4]. Several works 
in literature showed that the structure of a causal graph 
can, to some extent, be inferred from observational data. 
The vast majority of these works rely on statistical tests 
of conditional independence [19]. Here we present a 
way to reason about causality which do not use inde- 
pendence tests but estimate an information theory score 
to prioritize potential causes. 




Figure 1 Single-output case with different causal patterns: (i) 
common effect or explaining away effect configuration 
involving x 1# x 2 and y^, (ii) spouse configuration involving x 5 
and y-i', (iii) common cause configuration involving y v x 3/ x 4 ; 
and (iv) causal chain configuration involving x l7 y l7 x 4 . 



Let us consider a triplet made of two inputs x if x ; and 
one target y lt As discussed in [4] six possible configura- 
tions of directed acyclic graphs involving three variables 
can occur. One configuration is trivial and corresponds 
to a completely unconnected graph. One configuration 
corresponds to a single arrow chain (for example only x t 
and Xj are linked) and it is well known in literature that 
for a system of two variables the causal structure is not 
distinguishable. Another configuration corresponds to a 
fully connected graph and in this case the lack of inde- 
pendencies implies that the direction of the arrows can- 
not be determined. The remaining configurations can be 
illustrated and detected by studying the relationship [5] 
between the sign of /(x,-; x ; ; yi) and causal patterns of 
the triplet, like the ones sketched in Figure 1. 

A negative interaction /(x,; x ; ; y^) means that the 
knowledge of the value y 1 increases the amount of 
dependence between x, and x ; ; this situation occurs in 
the presence of a collider. According to the label of the 
collider we can have two cases: i) the common effect 
configuration (for example the pattern involving x 1} x 2 
and yi, also known as the explaining- aw ay effect) and ii) 
the spouse configuration (the pattern involving x 3 , x 5 
and y 1 in Figure 1 where x 3 is the common descendant 
of y 1 and x 5 ). This is a consequence of the fact that, if 
we assume Causal Faithfulness, the graph structure 
entails that the two parents are independent (null 
mutual information) but conditionally dependent (con- 
ditional mutual information bigger than zero). Note also 
that both configurations are characterized by the pre- 
sence of a collider. 

On the contrary a positive interaction 7(x,; x ; ; yi) 
between x t and x ; means that the knowledge of y x 
decreases the amount of dependence. This situation 
occur in two cases: i) the common cause configuration 
(for example, two dependent effects x 3 and x 4 become 
independent once the value of the common cause yi is 
known as illustrated in Figure 1) and ii) the causal 
chain configuration where one of the variables (let say, 
xi) is the cause and the other (let say, x 4 ) is the effect of 
yi. This is due to the fact that the graph entails the 
dependence between x t and x ; as well as their condi- 
tional independence (null conditional mutual 
information). 

So far we have considered a single-output configura- 
tion. However causal patterns can be better identified if 
we consider a multiple-output configuration, for 
instance the two output configuration sketched in Figure 
2. If y 1 and y 2 are two outputs representing different 
observations of the same phenomenon (for example a 
disease) we expect that the causal configurations con- 
cerning the first output appear also for the second one. 
This is a reasonable assumption in breast cancer clinical 
studies where the measured phenotypes (size and 
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Figure 2 Two-output case with different causal patterns: (i) 
common effect configuration involving x 3 , y n and y 2 ; (ii) spouse 
configuration involving y 2 and x 6 ; (iii) common cause 
configuration involving x 1# y n and y 2 ; and (iv) causal chain 
configuration involving x l7 y 2 and x 7 . 



histological grade of the tumor for instance) can be con- 
sidered as different manifestations of the state of the 
tumor. 

Let us consider for instance the inputs X! and x 2 and 
the two targets y 1 and y 2 : the common effect configura- 
tions between x 1 and x 2 and y 1 holds also for the triplet 
Xi and x 2 and y 2 . The same happens for the common 
cause pattern involving both the triplet x 3 , x 4 , y ± and x 3 , 
x 4 , y 2 . The presence of multiple outputs can therefore 
make more robust the identification of a causal pattern, 
especially in data configurations characterized by a very 
large number of variables. 

In the following we will take advantage of these con- 
siderations to design a causal filter able to extract from 
observed data causal dependencies between variables. 
The MIMO causal filter 

The link between causality and interaction discussed in 
the previous section suggests that, if we want to detect 
causality without estimating large variate dependencies, 
we may search for patterns like the one sketched in Fig- 
ure 3. This dependency pattern is characterized by two 
causal inputs and two outputs and can be detected 
when the following two conditions are satisfied: 

1 the interaction /(xi; x 2 ; y ± ) is negative 

2 the interaction I(xi; x 2 ; y 2 ) is negative 

In what follows we implement this idea into a MIMO 
causal filter where input variables belonging to causal 
patterns like the one in Figure 3 are prioritized. 

For the pair of inputs x x and x 2 and the pair of out- 
puts y 1 and y 2 , we define a structural score 




C(x 1/ x 2 ) 



--(7(x 1 ;x 2 ;y 1 )+/(x 1 ;x 2 ;y 2 )) 



(11) 



which is composed of two multiple-input interaction 
terms. The magnitude of this score depends on whether 
Xi and x 2 jointly play a joint causal role on y 1 and y 2 , or 
in other words, the pattern in Figure 3 is encountered. 
This means that the higher the term C{xi, x 2 ), the 
higher is the evidence that the pair xi, x 2 be a cause of 
Yi ad y 2 . This score plays a similar role to the score that 
is maximized in structural identification of Bayesian net- 
works [20]. If in that case the score measures the likeli- 
hood of the data for a given graph structure, here the 
quantity C{x 1} x 2 ) measures the likelihood of the data 
for a structural pattern where the pair x 1} x 2 has a cau- 
sal role. 

In the case of bivariate output (m = 2) we propose 
then a causal version s c of the univariate score s which 
accounts both for the relevance and the causal role of a 
pair of input variables Xi and x 2 

5 c ((xi,x 2 )) = J(xi; yi ) + I(x 2 ;y 1 ) + XC{x lf x 2 ) (12) 

where A > 0 stands for the degree of causality 
imposed to the selection. If we adopt the filter approxi- 
mation (10) the incremental formula takes the form 



- £> c ((x i; x fe )) = 



'( x fe;yi) + -t J2 c ( x < ;x '<) 



K^iYi) - -Jl (/(x^x^yj + /( Xl ;x fe ;y 2 )) 



(13) 



In other terms this formulation suggests to add at the 
(d + l) th step, among all the remaining variables, the 
one which has the better combination of relevance and 
causality, where the causal term is obtained by averaging 
over the selected variables and the considered outputs. 
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Note that in the case of m > 2 targets the structural 
score (11) is obtained by averaging the interaction terms 
over the m variables. 

Similarly to what is done in regularization approaches 
[21] where specific configurations (typically those with 
higher complexity) are penalized by adding a complexity 
term to the one measuring the error, the causality para- 
meter A in (13) is expected to penalize input variables 
with no causal role (positive interaction). Note that for 
A = 0 the selection rule (13) boils down to the rule (9). 
The following section will study the impact of the caus- 
ality term on the accuracy and the stability of a filter 
algorithm implementing the rule (13). 

Results 

In this section we perform two experiments to assess 
the role of the causation term in the feature selection 
process. The first one is based on a number of synthetic 
datasets generated by simulating a causal Bayesian net- 
work while the second relies on public microarray breast 
cancer datasets to assess the approach in a real data 
setting. 

Synthetic data 

This experiment focuses on the prioritization of causes 
in a set of classification tasks defined on the basis of 
simulated data generated by the causal structure 
depicted in Figure 4. Note that this causal structure 
aims to represent in a very simplified manner a stochas- 
tic dependency characterized by a number of indirect 
(nodes 1-3) and direct causes (nodes 4-8), a latent non 
measurable variable (node 9), one observable primary 
target (node 10), two secondary targets (nodes 11-12), a 
set of additional effects (nodes 13-29) and a number of 
independent and irrelevant variables (nodes 30-40). In 
order to set up an analogy with the real data experi- 
ments of the following subsection, we could make the 
assumption that the latent variable represents the cancer 
progression, the three targets denote a set of observable 
measures depending on the cancer state (patients' prog- 
nosis, size and histological grade of the tumor for 
instance), and that all other variables represent the 
expression of genes whose activity could play a causal 
role, be determined as an effect of the disease or be 
completely irrelevant. It is worth to note that also in the 
presence of a hidden variable the interaction between 
marginally independent causes given an effect is nega- 
tive. This is due to the fact that conditioning on the hid- 
den variable or on one of his children is equivalent in 
terms of d-separation between the variables [8] and con- 
sequently is equivalent in terms of the sign of the inter- 
action. We simulate a number of multivariate datasets 
from the causal structure in Figure 4 and for each of 
them we rank the inputs of the MIMO classification 



problem by using the conventional ranking approach 
based on mutual information (Equation (9)) and our 
novel approach based on causality (Equation (13)). The 
stochastic dependency between parents and descendants 
of the network is modeled by a linear regression where 
the parameters are uniformly sampled in [-2, 2] and the 
noise distribution is Gaussian with zero mean and stan- 
dard deviation a. We carry out a series of experiments, 
each characterized by 150 datasets and an increasing 
noise standard deviation ranging between 0.01 and 0.4. 
All the variables are continuous apart from the variables 
10, 11, and 12, which correspond to the targets y lf y 2 , 
and y 3 of the classification task and are discretized to 
two binary values. Note that all measures are centered 
and scaled in order to have a zero mean and unit stan- 
dard deviation; this allows for a better understanding of 
the impact of the noise amplitude on the ranking. 

The quality of our causal prioritization strategy is 
assessed by measuring the average ranking of the direct 
causes (nodes 4-8) and the percentage of time that the 
direct causes are ranked among the first 5 variables. 
These two measures (together with a 90% confidence 
interval) for different values of X are shown in Figure 5 
and 6 respectively. These plots show that by increasing 
the value of X, the average ranking position of direct 
causes decreases (direct causes are better prioritized) 
and that the percentage of correct selection increases 
(among the first ranked variables we find the direct 
causes with higher probability). The improvement 
occurs in a consistent manner for different values of the 
noise standard deviation though the detection of causal 
terms become less accurate as the noise increases. Note 
also that the very bad performance of the ranking (X = 
0) strategy (0% rate of correct selection) derives from 
the very large number of effects which tend to be 
ranked before the real causes. 

Real expression data 

The real data experiment consists of 6 public microarray 
datasets derived from breast cancer clinical studies 
(Table 1) in order to compare the generalization accu- 
racy of the selection returned by the conventional rank- 
ing approach based on mutual information (Equation 
(9)) with the accuracy of the selection returned by our 
novel approach based on causality (Equation (13)). 

All the microarray studies analyzed hereafter are char- 
acterized by the collection of gene expression data (the 
inputs X representing n = 13,091 unique genes), the sur- 
vival data (the primary target yi) and 2 additional clini- 
cal (secondary) variables about the state of the tumor, 
namely the histological grade and the tumor size. These 
clinical variables are well known by clinicians to be 
highly relevant for prognosis since large tumors of high 
grade are usually aggressive and lead to poor prognosis. 
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Figure 4 Bayesian causal network used for synthetic experiment. The green node 9 denotes the non observable variable. The three red 
nodes denote the targets of the multiple-output classification problem. The isolated node (30-40) represents a set of 1 1 independent variables. 




o.o 



0.1 



0.2 



0.3 



0.4 



noise std 

Figure 5 Synthetic data experiment: average ranking of direct 
causes for different values of X as a function of the noise 
standard deviation. Dotted lines are used to denote the 90% 
confidence interval estimated on the basis of 150 trials. 
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Figure 6 Synthetic data experiment: probability of selecting a 
direct cause among the first 5 ranked variables for different 
values of X as a function of the noise standard deviation. 

Dotted lines are used to denote the 90% confidence interval 
estimated on the basis of 150 trials. 
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Table 1 Affymetrix microarray datasets and related 
clinical study where the gene expression have been 
originally published 



Dataset 


Patients 


Reference 


UPP 


251 (110) 


[52] 


STK 


159 


[53] 


VDX 


344 


[54,55] 


UNT 


137 (92) 


[56] 


MAINZ 


200 


[57] 


TRANSBIG 


198 


[58] 



Duplicated patients between studies have been removed in two studies, UPP 
and UNT; the remaining unique patients are reported in brackets. All the 
datasets have been generated from Affymetrix technology and normalized 
using fRMA [51]. We consider for analysis the 13,091 unique genes common 
in all datasets. 



Each experiment was conducted in a meta-analytical 
[22] and cross-validation [23] framework, that is the set 
of variables are selected by relying on the samples of 
several datasets and the validation is performed on a set 
of samples not used for the selection. In order to adopt 
a classification framework, the survival of the patients 
was transformed in a binary class such as low or high 
risk of the patients given their clinical outcome at five 
years as in [24]. We conducted two sets of meta-analysis 
validation experiments to compare the conventional 
ranking approach (X = 0 case) and our causal version 
for different values of X: 

♦ Holdout: we carried out 100 training-and-test repe- 
titions where for each repetition the training set is 
composed of half of the samples of each dataset and 
the test is composed of the remaining ones. 

♦ Leave-one-dataset-out where for each dataset the 
features used for classification are selected without 
considering the patients of the dataset itself. Once 
the selection is over, 100 holdout repetitions are 
used to assess the generalization power of the 
selected set of features. 

All the mutual information terms are computed by 
using the Gaussian approximation (2). This allows the 
meta-analysis integration at the correlation level by 
means of the weighted estimation approach proposed by 
[22]. All the experiments were repeated for three sizes 
of the gene signature (number of selected features): v = 
20, 50, 100. 

The quality of the selection is represented by the 
accuracy of a Naive Bayes classifier measured by four 
different criteria: the Area Under the ROC curve (AUC), 
the Root Mean Squared Error (RMSE), the SAR 
(Squared error, Accuracy, and ROC score introduced by 
[25]) and the precision-recall F score measure [26]. 
Table 2 reports for the holdout experiment the value of 



the four performance criteria for different values of v 
and X. Table 3 refers to the leave-one-dataset-out 
experiments for v = 20, v = 50, and v = 100, respec- 
tively. Note that the W-L (Win-Loss) line reports the 
number of datasets for which the causal filter is signifi- 
cantly more (W) or less (L) accurate than the ranking 
filter according both to the McNemar test [27] (p-value 
< 0.05 adjusted for multiple testing by Holm's method 
[28]) and the Wilcoxon paired test on squared errors 
(p-value < 0.05 adjusted for multiple testing by Holm's 
method). 

Discussion 

In the previous section we reported the accuracy results 
of the traditional ranking approach and our novel 
method based on a causal relevance score. Here we dis- 
cuss the added value of our causal approach both from 
a quantitative and qualitative perspective. 

The performance measured in cross-validation sug- 
gests that the incorporation of a causal term leads to a 
significant improvement of classification accuracy. This 
improvement is observed for different validation config- 
urations and different sizes of the prognostic gene signa- 
ture. From these results we can conclude that (i) causal 
feature selection is interesting also for a prediction per- 
spective and (ii) relevant (prognostic) information is 
contained into secondary output variables (in our case 
tumor size and histological grade). Although the abso- 
lute improvement is only moderate (3% to 6% depend- 
ing on the validation configurations and performance 
estimates), the use of our causal ranking strategy in 
more sophisticated modeling approach for prognosis, 
such as in [29], may help develop more clinically rele- 
vant prognostic classifiers in breast cancer. 

The other advantage of our approach is that the intro- 
duction of a causality term leads to an interpretation of 
the causal role of the selected genes. We illustrate this 
characteristic in Figure 7 by comparing, through Gene 
Ontological (GO) terms, gene rankings with increasing 
degree of causality using a pre-ranked gene set enrich- 
ment analysis (GSEA) [30]. By quantifying how the cau- 
sal rank of genes diverges from the conventional one (X 
= 0) with respect to X we can identify the gene sets that 
are potential causes or effects of breast cancer. 

Genes that remains among the top ranked ones for 
increasing X can be considered as relevant (they contain 
predictive information about survival) and causal. Genes 
whose rank increases for increasing X are putative 
causes: they have less relevance than other genes (for 
example, those being direct effects) but they are poten- 
tially causal. These genes would have been missed by 
conventional ranking, where they would appear as false 
negatives if we interpret the outcome of conventional 
ranking in causal terms. Genes whose rank decreases for 
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Table 2 Holdout: accuracy criteria (to be maximized) for different numbers v of variables and different values of X 


v = 20 


X = 0 


X = 0.2 


X = 0.4 


X = 0.6 


X = 0.8 


X = 0.9 


X= 1 


X = 2 


AUC 


0.688 


0.688 


0.694 


0.699 


0.703 


0.704 


0.705 


0.707 


1-RMSE 


0.460 


0.466 


0.481 


0.493 


0.504 


0.510 


0.515 


0.542 


SAR 


0.559 


0.561 


0.569 


0.575 


0.580 


0.583 


0.585 


0.595 


F 


0.255 


0.254 


0.260 


0.262 


0.265 


0.265 


0.266 


0.274 


W-L 




1-0 


3-0 


5-0 


6-0 


5-0 


5-0 


5-0 


v= 50 


X = 0 


X = 0.2 


X = 0.4 


X = 0.6 


X = 0.8 


X = 0.9 


X = 1 


X = 2 


AUC 


0.693 


0.698 


0.702 


0.706 


0.709 


0.710 


0.711 


0.715 


1-RMSE 


0.451 


0.458 


0.465 


0.471 


0.477 


0.479 


0.482 


0.503 


SAR 


0.552 


0.556 


0.562 


0.567 


0.571 


0.572 


0.574 


0.583 


F 


0.263 


0.265 


0.268 


0.270 


0.272 


0.271 


0.273 


0.277 


W-L 




2-0 


3-0 


3-0 


2-0 


2-0 


3-0 


4-0 


v= 100 


X = 0 


X = 0.2 


X = 0.4 


X = 0.6 


X = 0.8 


X = 0.9 


X = 1 


X = 2 


AUC 


0.699 


0.704 


0.708 


0.711 


0.714 


0.715 


0.715 


0.716 


1-RMSE 


0.454 


0.457 


0.459 


0.463 


0.467 


0.470 


0.472 


0.487 


SAR 


0.545 


0.549 


0.553 


0.557 


0.561 


0.563 


0.564 


0.573 


F 


0.272 


0.271 


0.272 


0.274 


0.274 


0.274 


0.275 


0.284 



W-L 1-0 1-0 1-0 2-0 3-0 4-1 4-1 



AUC = Area Under the Curve; 1-RMSE = one minus Root Mean Squared Error; SAR = Squared error, Accuracy, and ROC; F = precision-recall; W-L = Win -Loss 
reporting the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the conventional ranking filter according both to 
the McNemar test (p-value < 0.05 adjusted for multiple testing by Holm's method) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for 
multiple testing by Holm's method). 



increasing X are putative effects in the sense that they 
are relevant but probably not causal. This set of genes 
could be erroneously considered as causal, and represent 
false positives if we interpret the outcome of conven- 
tional ranking in causal terms. 



Since genes are not acting in isolation but rather in 
pathways, we analyzed the gene rankings in terms of 
gene set enrichment. As described in [30], the normal- 
ized enrichment score (NES) computed in GSEA enables 
quantification of the strength of association of a gene set 



Table 3 Leave-one-dataset-out: accuracy criteria (to be maximized) for different numbers v of variables and different 
values of X 



v =20 


?t = 0 


X = 0.2 


X = OA 


X = 0.6 


X = 0.8 


X = 0.9 


X= 1 


X = 2 


AUC 


0.678 


0.674 


0.678 


0.680 


0.682 


0.682 


0.680 


0.669 


1-RMSE 


0.447 


0.448 


0.467 


0.469 


0.482 


0.528 


0.544 


0.556 


SAR 


0.553 


0.552 


0.560 


0.561 


0.566 


0.582 


0.586 


0.586 


F 


0.280 


0.275 


0.275 


0.281 


0.279 


0.283 


0.287 


0.276 


W-L 




1-1 


5-1 


2-0 


4-0 


5-0 


4-0 


4-0 


v = 50 


X = 0. 


X = 0.2 


X = 0.4 


X = 0.6 


X = 0.8 


X = 0.9 


X = 1 


X = 2 


AUC 


0.681 


0.687 


0.692 


0.693 


0.698 


0.700 


0.700 


0.693 


1-RMSE 


0.428 


0.438 


0.453 


0.457 


0.464 


0.473 


0.490 


0.516 


SAR 


0.542 


0.551 


0.559 


0.561 


0.565 


0.569 


0.576 


0.582 


F 


0.284 


0.284 


0.281 


0.281 


0.285 


0.291 


0.298 


0.303 


W-L 




3-0 


4-0 


5-1 


3-0 


5-0 


4-0 


6-0 


v= 100 


X = 0 


X = 0.2 


X = 0.4 


X = 0.6 


X = 0.8 


X = 0.9 


X = 1 


X = 2 


AUC 


0.687 


0.694 


0.704 


0.708 


0.711 


0.706 


0.708 


0.676 


1-RMSE 


0.430 


0.436 


0.449 


0.457 


0.463 


0.463 


0.476 


0.477 


SAR 


0.537 


0.545 


0.556 


0.562 


0.566 


0.565 


0.571 


0.561 


F 


0.290 


0.292 


0.294 


0.296 


0.299 


0.294 


0.304 


0.288 



W-L 1-0 4-0 6-0 4-0 4-0 5-0 5-1 



AUC = Area Under the Curve; 1-RMSE = one minus Root Mean Squared Error; SAR = Squared error, Accuracy, and ROC; F = precision-recall; W-L = Win -Loss 
reporting the number of datasets for which the causal filter is significantly more (W) or less (L) accurate than the conventional ranking filter according both to 
the McNemar test (p-value < 0.05 adjusted for multiple testing by Holm's method) and the Wilcoxon paired test on squared errors (p-value < 0.05 adjusted for 
multiple testing by Holm's method). 
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(GO term) with a phenotye of interest, here poor or 
good prognosis (survival). In more details, given a list of 
genes L ranked by their prognostic relevance and an a 
priori defined set of genes S (for example genes sharing 
the same GO category), the goal of GSEA is to deter- 
mine whether the members of S are randomly distribu- 
ted throughout L or primarily found at the top or 
bottom; gene sets associated with the prognosis pheno- 
type tend to show the latter distribution. NES reflects 
the degree to which a gene set S is overrepresented at 
the extremes (top or bottom) of the entire ranked list L. 
The score is calculated by walking down the list L, 
increasing a running-sum statistic when a gene in S is 
encountered and decreasing it when genes not in S are 
encountered. The magnitude of the increment depends 
on the statistic used to rank the genes in L. In our study 
the statistic of a gene is simply its rank (the most rele- 
vant genes have the largest ranks) and its sign depends 
on the association of its expression with survival: posi- 
tive sign if over-expression is associated with poor survi- 
val and inversely. The score is the maximum deviation 
from zero encountered in the "walk"; it corresponds to a 
weighted Kolmogorov-Smirnov-like statistic [30,31]. 
Finally the score is normalized for each gene set to 
account for the size of the gene set, yielding a NES. 

We computed NES for multiple genome-wide rank- 
ings generated with increasing values of X, and displayed 
in Figure 7 the score of the 3 most enriched GO terms 
which are identified as being potentially (A) both causes 
and effects, (B) causes, and (C) effects of breast tumori- 
genesis (GSEA results for all the GO terms are provided 
in Additional File 1, 2 and 3). The first group of GO 



terms that show similar enrichment scores indepen- 
dently of their level of causality are implicated in cell 
movement and division, cellular respiration and regula- 
tion of cell cycle (Figure 7A). The first GO term 
involves genes encoding for the Rho family of GTPases 
proteins that are among key regulators of actin and 
microtubule cytoskeleton [32] and are often over- 
expressed in human breast cancers [33]. Bromberg et al. 
showed that, when affected by RNF5, this family of pro- 
teins may cause dysregulation of cell proliferation to 
promote tumor progression [34]. The second GO term 
represents the co-enzyme metabolic process which 
includes proteins showed to be early indicators of breast 
cancer [35]; perturbation of these co-enzymes might 
cause cancers by compromising the structure of impor- 
tant enzyme complexes implicated in mitochondrial 
functions [35]. Genes involved in the third GO term 
"regulation cyclin-dependent protein kinase activity" are 
key players in cell cycle regulation and inhibition of 
such kinases proved to block proliferation of human 
breast cancer cells [36]. Moreover, Moore et al. recently 
highlighted the role of cyclin-dependent kinases as pro- 
gesterone activators that could give raise to tumors and 
sustain their progression in breast cancer [37]. 

Figure 7B displays the GO terms that are increasingly 
enriched with their degree of causality, involving genes 
that are putative causes of the tumorigenesis affecting 
patients' survival; these genes might have been missed 
by the conventional ranking approach (X = 0). Counter- 
intuitively, the three GO terms in this category are 
related to the immune system what is sought to be 
more an effect of the tumor growth as lymphocytes 



Microtubule cytoskeleton 
organization and biogenesis 



Coenzyme metabolic process 



Regulation of cyclin dependent 
protein kinase activity 



Cellular defense 
response 

Inflammatory response 
Defense response 



M phase of mitotic cycle 



M phase 



DNA replicaton 



NES 



A 



NES 



NES 



0 0.5 12 

Figure 7 Most enriched GO terms with respect to X according to a pre-ranked gene set enrichment analysis (GSEA): (A) GO terms 
enriched in the conventional ranking and having a high degree of causality for tumorigenesis; (B) GO terms increasingly enriched 
with respect to larger X, suggesting they are putative causes for tumorigenesis; (C) GO terms decreasingly enriched with respect to 
larger X, suggesting they are putative effects for tumorigenesis. The normalized enrichment score (NES) depends on the genome-ranking 
of the genes, which in turn depends on X. Larger the NES of a GO term, stronger the association of this gene set with survival; the sign of NES 
reflects the direction of association of the GO term with survival, a positive score meaning that over-expression of the genes implies worst 
survival and inversely. 
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strike cancer cells as they proliferate. However, several 
independent research groups showed that frequent 
usage of aspirin significantly decrease the long-term risk 
of cancer death by correcting immune system dysfunc- 
tion [38,39], findings that have been confirmed in breast 
cancer [40], what supports that the immune system 
might have a causal role in tumorigenesis. There is 
strong evidence of interplay between immune system 
and tumors since solid tumors are commonly infiltrated 
by immune cells; in contrast to infiltration of cells 
responsible for chronic inflammation, the presence of 
high numbers of lymphocytes, especially T cells, has 
been reported to be an indicator of good prognosis in 
many cancers [41], what concours with the sign of the 
enrichment (negative enrichment; Figure 7B). We and 
others have reported that gene expression signatures 
representing the immune response process were asso- 
ciated with a better prognosis in particular subtypes of 
breast cancer [29,42,43]. 

The last group of GO terms are less enriched when 
the degree of causality increases and the vast majority of 
the corresponding genes are related to cell-cycle and 
proliferation (Figure 7C). Cell-cycle and proliferation- 
related genes, such as for example Ki67, have been used 
for many decades to describe breast tumors: High levels 
of Ki67 have been correlated with worse prognosis and 
are also known to be associated with high tumor grade 
and negativity of estrogen receptor status [44,45]. We 
and others have shown that a quantitative measurement 
of proliferation genes using mRNA gene expression 
could provide an accurate assessment of prognosis of 
breast cancer patients [43,46,47]. We also have shown 
that only one of those genes, AURKA, which is signifi- 
cantly enriched in this case in the M phase GO term, 
was sufficient to recapitulate the prognostic performance 
of different prognostic signatures [48]. However the 
enrichment of these proliferation-related genes seems to 
be a downstream effect of the breast tumorigenesis 
instead of its cause. 

Our approach allows to identify biological processes 
that may be direct causes of cancer. These processes are 
likely to be missed by conventional methods. Given the 
promising performance of our approach, we plan to 
integrate our method in analytical frameworks combin- 
ing efficiently the available clinical data and a priori bio- 
logical knowledge, potentially retrieved from biomedical 
literature [49] or pathway database [50], in order to 
unravel gene sets or network of genes causal of cancer 
patients' survival. 

Conclusions 

It is well known in statistics that correlation does not 
imply causation or, in more general terms, that fea- 
tures that are relevant or strongly relevant for 



predicting a target are not necessarily direct causes. 
Direct effects are typical examples of variables that 
provide information about a target without having any 
causal role. In a data-driven approach to gene selection 
it is therefore more and more important to discrimi- 
nate not only between relevant and non-relevant vari- 
ables but also, within the subset of relevant variables, 
to discriminate between direct or indirect causes and 
effects. This paper proposes a computationally afford- 
able strategy to infer causal patterns that take advan- 
tage of multiple outputs. Experimental results in terms 
of accuracy and clinical interpretation show the added 
value deriving from the inclusion of a causal term into 
conventional ranking. 

Additional material 

c \ 

Additional file 1: Spreadsheet containing the normalized 
enrichment scores with respect to increasing X as computed by 
preranked GSEA (gsea_res_all.csv). 

Additional file 2: Archive containing the output files computed by 
the preranked GSEA for X L {0.1,0.2,0.3,0.4,0.5} (GSEA_MIMO_part1. 
zip). 

Additional file 3: Archive containing the output files computed by 
the preranked GSEA for X L {0.6,0.7,0.8,0.9,1.0,2.0} 
(GSEA_MIMO_part2.zip). 
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